Skip to content

Convert build process to saga framework#842

Open
phinze wants to merge 2 commits into
mainfrom
phinze/mir-441-build-process-saga
Open

Convert build process to saga framework#842
phinze wants to merge 2 commits into
mainfrom
phinze/mir-441-build-process-saga

Conversation

@phinze

@phinze phinze commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This converts the build pipeline (BuildFromTar / BuildFromPrepared) from a ~450-line synchronous function into the saga framework from MIR-439. Eleven actions cover the pipeline, from receiving the source tar through building the image, creating the config and version entities, provisioning addons, and activating the new version, each with a compensating undo so a failure partway through rolls back cleanly instead of leaving half-built state behind.

Two small primitives bridge the saga world (which needs everything serializable for crash recovery) to the non-serializable RPC stream state. StreamRegistry stages the inbound tar to disk so a recovered saga can still find the source after the original io.Reader is gone, and StatusRegistry carries the outbound progress stream, handing back a no-op sender when the saga is recovering and the client is long gone. A SagaBuilder wraps the whole thing as a drop-in implementation of the existing Builder RPC interface, and coordinate.go selects it behind the existing labs.Sagas() flag.

The interesting part is what manual testing turned up. Running a real deploy through the saga path failed at the very end with "execution has status pending, expected completed," even though the build had clearly succeeded: the saga's entire state had frozen at its initial pending write. The cause was in the saga framework itself, not the build code. EntityStorage.Save and EACStorage.Save persisted through create-if-absent primitives, so every save after the first was silently dropped. The sandbox saga from MIR-439 had been hitting this the whole time, it just never noticed because nothing read its state back. The fix makes Save a real upsert. A second, smaller bug fell out of the same testing: each executor's recovery treated another executor's sagas (build and sandbox share storage) as errors, so we scoped recovery to skip definitions an executor doesn't own.

To keep both from regressing, there's a storage conformance suite that runs the persistence scenarios against all three Storage backends. The deeper question this raised, whether the mock store can be trusted to stand in for etcd in the first place, is addressed separately in the entity.Store conformance suite (#847).

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 07e6e5f0-14f3-4901-b0d0-cd93fecc30dc

📥 Commits

Reviewing files that changed from the base of the PR and between e79d406 and ee27cb3.

📒 Files selected for processing (16)
  • components/coordinate/coordinate.go
  • pkg/labs/features.yaml
  • pkg/labs/labs.gen.go
  • pkg/saga/eac_storage.go
  • pkg/saga/executor.go
  • pkg/saga/storage.go
  • pkg/saga/storage_conformance_test.go
  • servers/build/build.go
  • servers/build/build_saga.go
  • servers/build/build_saga_buildkit.go
  • servers/build/build_saga_test.go
  • servers/build/saga_builder.go
  • servers/build/status_registry.go
  • servers/build/status_registry_test.go
  • servers/build/stream_registry.go
  • servers/build/stream_registry_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/labs/features.yaml
🚧 Files skipped from review as they are similar to previous changes (10)
  • components/coordinate/coordinate.go
  • servers/build/status_registry_test.go
  • servers/build/stream_registry_test.go
  • servers/build/stream_registry.go
  • servers/build/build_saga_test.go
  • servers/build/status_registry.go
  • servers/build/saga_builder.go
  • servers/build/build_saga_buildkit.go
  • servers/build/build.go
  • servers/build/build_saga.go

📝 Walkthrough

Walkthrough

This PR adds a saga-orchestrated build pipeline: StreamRegistry and StatusRegistry for staging tar streams and streaming build status; refactors Builder.buildFromDir to use detectBuildStack and runBuildkitBuild; defines a restartable build-from-tar saga with compensating undo actions; provides a SagaBuilder RPC wrapper that manages streams/status and populates results; adds tests for sagas, registries, storage conformance, and coordinator wiring; and conditionally enables the saga builder via the sagas feature flag.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
servers/build/build_saga_test.go (2)

109-124: ⚡ Quick win

Make stub artifact names unique per run.

Using a fixed <app>-stub entity name makes this helper fragile if a test starts the saga more than once against the same in-memory store.

♻️ Proposed fix
-	artifactName := in.AppName + "-stub"
+	artifactName := in.AppName + "-" + in.VersionName + "-stub"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@servers/build/build_saga_test.go` around lines 109 - 124, The helper
currently uses a fixed artifactName ("artifactName := in.AppName + \"-stub\"")
which causes collisions when the saga is started multiple times against the same
in-memory store; change artifactName generation in the build helper to include a
run-unique suffix (e.g., append time.Now().UnixNano() or a generated UUID)
before calling deps.builder.ec.Create so each created entity name is unique and
avoids test interference with artifact creation via deps.builder.ec.Create.

293-309: ⚡ Quick win

Assert compensation by checking entity absence, not only UndoneAt.

This currently verifies saga metadata only. Add direct store checks for the created ConfigVersion/AppVersion entities to ensure compensators actually removed persisted rows.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@servers/build/build_saga_test.go` around lines 293 - 309, The test currently
only checks
exec.ExecutedActions[actionCreateConfigVer|actionCreateVersion].UndoneAt; extend
it to assert the compensators actually removed persisted rows by attempting to
load the corresponding entities (ConfigVersion and AppVersion) from the store
and expecting a "not found" result. After you retrieve exec from storage.Get,
derive the created entity IDs from the execution record or action payload, then
call the store read methods for ConfigVersion and AppVersion (use the project's
existing load/get functions) and assert they return a not-found error (or nil
result) rather than a valid entity; keep these checks adjacent to the existing
UndoneAt assertions for actionCreateConfigVer and actionCreateVersion.
servers/build/build_saga.go (1)

659-689: 💤 Low value

Minor: Unused variable workaround can be removed when migration check is updated.

Line 675 has _ = status to suppress the unused variable warning. The comment at lines 667-672 explains this is a temporary regression. Consider tracking this as technical debt.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@servers/build/build_saga.go` around lines 659 - 689, The _ = status no-op
exists to silence an unused variable in finalize; remove this workaround (delete
the `_ = status` line) and either use status where intended or add a short TODO
comment referencing the pending migration work so the linter won't be suppressed
silently; locate this in the finalize function (symbols: finalize, status,
checkLocalStorageMigration) and ensure no other unused-variable warnings remain
after removal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@servers/build/stream_registry.go`:
- Around line 140-152: The cleanup currently deletes s.staged[streamID] and
s.streams[streamID] while holding s.mu before calling os.RemoveAll, so if
RemoveAll fails the registry loses the staged path; change the flow in the
Cleanup method to first read (but not delete) the staged path under s.mu,
release the lock, call os.RemoveAll(path), then re-acquire s.mu and only delete
s.staged[streamID] and s.streams[streamID] if removal succeeded (or restore/keep
the entry on error). Use the existing s.mu, s.staged, s.streams, and
os.RemoveAll references to locate and modify the logic accordingly.

---

Nitpick comments:
In `@servers/build/build_saga_test.go`:
- Around line 109-124: The helper currently uses a fixed artifactName
("artifactName := in.AppName + \"-stub\"") which causes collisions when the saga
is started multiple times against the same in-memory store; change artifactName
generation in the build helper to include a run-unique suffix (e.g., append
time.Now().UnixNano() or a generated UUID) before calling deps.builder.ec.Create
so each created entity name is unique and avoids test interference with artifact
creation via deps.builder.ec.Create.
- Around line 293-309: The test currently only checks
exec.ExecutedActions[actionCreateConfigVer|actionCreateVersion].UndoneAt; extend
it to assert the compensators actually removed persisted rows by attempting to
load the corresponding entities (ConfigVersion and AppVersion) from the store
and expecting a "not found" result. After you retrieve exec from storage.Get,
derive the created entity IDs from the execution record or action payload, then
call the store read methods for ConfigVersion and AppVersion (use the project's
existing load/get functions) and assert they return a not-found error (or nil
result) rather than a valid entity; keep these checks adjacent to the existing
UndoneAt assertions for actionCreateConfigVer and actionCreateVersion.

In `@servers/build/build_saga.go`:
- Around line 659-689: The _ = status no-op exists to silence an unused variable
in finalize; remove this workaround (delete the `_ = status` line) and either
use status where intended or add a short TODO comment referencing the pending
migration work so the linter won't be suppressed silently; locate this in the
finalize function (symbols: finalize, status, checkLocalStorageMigration) and
ensure no other unused-variable warnings remain after removal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 23e291f7-2359-4312-9c88-01f1636a9838

📥 Commits

Reviewing files that changed from the base of the PR and between be23ed9 and ccf6430.

📒 Files selected for processing (11)
  • components/coordinate/coordinate.go
  • pkg/labs/features.yaml
  • pkg/labs/labs.gen.go
  • servers/build/build.go
  • servers/build/build_saga.go
  • servers/build/build_saga_test.go
  • servers/build/saga_builder.go
  • servers/build/status_registry.go
  • servers/build/status_registry_test.go
  • servers/build/stream_registry.go
  • servers/build/stream_registry_test.go

Comment thread servers/build/stream_registry.go
@phinze phinze force-pushed the phinze/mir-441-build-process-saga branch from ccf6430 to a435c56 Compare June 4, 2026 14:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@servers/build/stream_registry.go`:
- Around line 85-96: The code deletes s.streams[streamID] before calling
os.MkdirTemp, so if MkdirTemp fails the unread reader is lost; change Stage so
that either (a) you create the temp dir before deleting the entry, or (b) if you
must delete first, on MkdirTemp error reinsert the saved reader back into
s.streams under s.mu to restore state. Concretely: keep the local reader
variable, call os.MkdirTemp, and only delete s.streams[streamID] after
successful dir creation; or if you keep the delete earlier, surround the
MkdirTemp call and on error acquire s.mu and set s.streams[streamID] = reader
before returning the error (use s.mu to protect the map).
- Around line 128-133: MarkStaged currently overwrites any existing staged path
for a streamID, losing the original prepared directory; change
StreamRegistry.MarkStaged so it preserves the first staged path by checking
s.staged for streamID and only setting s.staged[streamID] = path if it does not
already exist, while still removing the entry from s.streams; reference
StreamRegistry.MarkStaged (and behavior consistent with Register/Cleanup) to
locate the method to update.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6862bc7d-155d-4a57-9af9-2bdefc27e080

📥 Commits

Reviewing files that changed from the base of the PR and between ccf6430 and a435c56.

📒 Files selected for processing (11)
  • components/coordinate/coordinate.go
  • pkg/labs/features.yaml
  • pkg/labs/labs.gen.go
  • servers/build/build.go
  • servers/build/build_saga.go
  • servers/build/build_saga_test.go
  • servers/build/saga_builder.go
  • servers/build/status_registry.go
  • servers/build/status_registry_test.go
  • servers/build/stream_registry.go
  • servers/build/stream_registry_test.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/labs/labs.gen.go
  • pkg/labs/features.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
  • components/coordinate/coordinate.go
  • servers/build/status_registry_test.go
  • servers/build/build_saga_test.go
  • servers/build/build_saga.go
  • servers/build/saga_builder.go
  • servers/build/stream_registry_test.go
  • servers/build/status_registry.go
  • servers/build/build.go

Comment thread servers/build/stream_registry.go
Comment thread servers/build/stream_registry.go
@phinze phinze force-pushed the phinze/mir-441-build-process-saga branch 2 times, most recently from 8f513b0 to e79d406 Compare June 4, 2026 23:36

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
servers/build/build_saga.go (1)

438-442: 💤 Low value

Consider logging a warning if EphemeralExpiresAt fails to parse.

The parse error is silently ignored. While this value comes from handleEphemeral which formats with RFC3339 (so it should never fail), silent failures could mask data corruption issues where ephemeral versions never expire.

Suggested enhancement
 		if in.EphemeralExpiresAt != "" {
-			if t, err := time.Parse(time.RFC3339, in.EphemeralExpiresAt); err == nil {
+			t, err := time.Parse(time.RFC3339, in.EphemeralExpiresAt)
+			if err != nil {
+				b.Log.Warn("failed to parse ephemeral expiration time", "value", in.EphemeralExpiresAt, "error", err)
+			} else {
 				av.EphemeralExpiresAt = t
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@servers/build/build_saga.go` around lines 438 - 442, The parse error for
in.EphemeralExpiresAt is currently ignored; update the time.Parse branch to log
a warning when parsing fails (include the offending in.EphemeralExpiresAt string
and the parse error) while preserving current behavior when parse succeeds (set
av.EphemeralExpiresAt = t). Use the package's logger (or the standard log
package) to emit the warning from the same context where in.EphemeralExpiresAt
and av.EphemeralExpiresAt are handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@servers/build/build_saga.go`:
- Around line 438-442: The parse error for in.EphemeralExpiresAt is currently
ignored; update the time.Parse branch to log a warning when parsing fails
(include the offending in.EphemeralExpiresAt string and the parse error) while
preserving current behavior when parse succeeds (set av.EphemeralExpiresAt = t).
Use the package's logger (or the standard log package) to emit the warning from
the same context where in.EphemeralExpiresAt and av.EphemeralExpiresAt are
handled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2d60c1b9-1aa4-4a0e-a8ac-69b5d4d9f694

📥 Commits

Reviewing files that changed from the base of the PR and between 8f513b0 and e79d406.

📒 Files selected for processing (12)
  • components/coordinate/coordinate.go
  • pkg/labs/features.yaml
  • pkg/labs/labs.gen.go
  • servers/build/build.go
  • servers/build/build_saga.go
  • servers/build/build_saga_buildkit.go
  • servers/build/build_saga_test.go
  • servers/build/saga_builder.go
  • servers/build/status_registry.go
  • servers/build/status_registry_test.go
  • servers/build/stream_registry.go
  • servers/build/stream_registry_test.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/labs/features.yaml
  • pkg/labs/labs.gen.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • servers/build/status_registry.go
  • components/coordinate/coordinate.go
  • servers/build/status_registry_test.go
  • servers/build/stream_registry_test.go
  • servers/build/build.go
  • servers/build/saga_builder.go
  • servers/build/build_saga_test.go
  • servers/build/stream_registry.go

The build pipeline (BuildFromTar / BuildFromPrepared / buildFromDir)
ran as a 450-line synchronous function whose only crash recovery was
defer-based cleanup that died with the process. A server restart mid
build could leave orphaned ConfigVersion or AppVersion entities, with
no way to either resume the build or compensate the partial state.

Converted to the saga framework from MIR-439, following the sandbox
saga pattern from MIR-440. Each phase of the build becomes a saga
action with an undo; a new StreamRegistry stages incoming tar data
to disk so recovery survives the loss of the original io.Reader, and
a symmetric StatusRegistry handles outbound progress streaming
without coupling actions to the per-request RPC stream. SagaBuilder
implements the Builder RPC interface and coordinate.go selects it
behind the existing labs.Sagas() flag.

Closes MIR-441
@phinze phinze force-pushed the phinze/mir-441-build-process-saga branch from e79d406 to a0d23d4 Compare June 5, 2026 13:03
Converting the build process to sagas (the parent commit) surfaced two
framework bugs that only bite a saga which reads its own state back or
shares storage with another executor. The build saga does both.

EntityStorage.Save and EACStorage.Save persisted through create-if-
absent primitives (EnsureEntity / EAC Ensure), so every save after the
first was silently dropped. A saga's record froze at its initial
pending state: status never advanced to completed, no action results
were recorded. Reading outputs back failed ("status pending, expected
completed"), and on restart recovery re-ran already-finished sagas
because they still looked pending. MemoryStorage overwrites, so unit
tests stayed green while both production backends were broken. Save now
upserts: EntityStorage ensures-then-replaces, EACStorage uses Put.

Recovery treated a saga whose definition isn't in the executor's
registry as an error. But storage is shared across executors (build and
sandbox each run their own) and ListIncomplete returns every executor's
sagas, so each tripped over the other's on startup. Recovery now skips
definitions it doesn't own, leaving them to their owning executor.

A storage conformance suite runs the persistence scenarios against all
three Storage backends so a backend that silently drops updates can't
pass again.
@phinze phinze marked this pull request as ready for review June 5, 2026 16:22
@phinze phinze requested a review from a team as a code owner June 5, 2026 16:22

@evanphx evanphx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great cleanup!

@@ -0,0 +1,169 @@
package saga

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We picked this up in another PR already I'm pretty sure.

Comment on lines +211 to +238
// prepareConfig assembles the final ConfigSpec for the new version by
// merging build outputs, app.toml, Procfile, and existing app config,
// then runs every blocking validation (services exist, required vars
// have values, node ports are free, disk references resolve). Pure
// computation + entity reads — no side effects, no undo needed.
//
// Validation failures surface as the saga error and a user-facing
// status update. The pre-saga path called validateNodePorts and
// validateDiskConfigs separately; bundling them with the rest in one
// action keeps the saga DAG simple and matches what the user
// experiences as one logical "config prep" step.

type prepareConfigIn struct {
AppName string `json:"app_name" saga:"app_name"`
StreamID string `json:"stream_id" saga:"stream_id"`
AppID string `json:"app_id" saga:"app_id"`
BuildResult *BuildResult `json:"build_result,omitempty" saga:"build_result,optional"`
AppConfig *appconfig.AppConfig `json:"app_config,omitempty" saga:"app_config,optional"`
ProcfileServices map[string]string `json:"procfile_services,omitempty" saga:"procfile_services,optional"`
ExistingConfig string `json:"existing_config_json" saga:"existing_config_json"`
CLIEnvVars []*build_v1alpha.EnvironmentVariable `json:"cli_env_vars,omitempty" saga:"cli_env_vars,optional"`
}

type prepareConfigOut struct {
ConfigSpec string `json:"config_spec_json" saga:"config_spec_json"`
}

func prepareConfig(ctx context.Context, in prepareConfigIn) (prepareConfigOut, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-point: man these are so good for cleaning up the logic. It's immediately obvious what can influence the config.

Comment on lines +539 to +548
func undoSetActiveVersion(ctx context.Context, in setActiveVersionIn, out setActiveVersionOut) error {
if out.Skipped {
return nil
}
deps := saga.Get[*buildSagaDeps](ctx)
if err := deps.builder.appClient.SetActiveVersion(ctx, in.AppName, out.PreviousVersionID); err != nil {
return fmt.Errorf("restoring previous active version on %s: %w", in.AppName, err)
}
return nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Comment on lines +21 to +22
// StreamRegistry bridges non-serializable tar streams into the saga world by
// staging them to durable filesystem paths. The pattern, per RFD-35:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good abstraction because I could see us eventually having the ability to consume the tar from other places and this makes that easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants